Problem Statement

Based on the laboratory tests collected from the suspected cases, predict the chances of being positive or negative for covid19 and identify the factors that influence it. Also, provide the recommendations to the hospital on how they can better manage the admission of patients to the general ward, semi-intensive unit, or intensive care unit.

Need of the study/project

  1. One of the motivations for this problem was the fact that in the context of an overwhelmed health system with the possible limitation to performing tests for the detection of SARS-CoV-2, testing every case with some mild symptoms like cold & cough or suffering from mild fever would be impractical and test results could be delayed because of more number of people complaining about mild symptoms & going for tests. So setting up criteria to get tested was very important to focus on getting the test results early & also to reduce the spread of the virus quicker.

  2. To slow the virus's spread across the country & to minimize the number of hospitalizations with severity.

Understanding business/social opportunity

How the hospital can better manage the admission of patients to the general ward, semi-intensive unit, or intensive care unit. The decease spreads so rapidly and large number of cases are coming everyday but the number of hospital beds are limited. Its needed to study deeply to save maximum number of life.

Importing necessary libraries and data

converting xlsx file to csv file

Data Overview

  1. There are missing values in the dataset.
  2. The seventy four variables are numerical and therefore their python data types (int64 and float64) are ok.
  3. The other thirty seven variables are object datatype

There are no duplicate values in the dataset.

There are many missing values in the number pf columns.

the statistical summary of the data

  1. The mean is slightly greater than median in 'Patient age quantile' in dicating slightly right skewed.
  2. Hematocrit is left skewed.
  3. Phosphor is right skewed.
  4. pO2 (arterial blood gas analysis) is right skewed ... and so on

Removal of unwanted variables

Exploratory Data Analysis (EDA)

data containing only numerical columns

Plotting all the features at one go for numerical variables

  1. Patient age quantile is uniforly distributed.
  2. There are outliers in the distribution of many variables (e.g., Hematocrit, Hemoglobin, platelets etc.)
  3. Binary distribution is Vitamin B12

Plotting all the features at one go for categorical variables

data containing only objective columns

The bar plots for categorical varible are shown in the above plots with percentage counts.

Bivariate Analysis

  1. The red boxes indicate the highly negative correlation between two variables. If the of one variable increases the correspoding other variable decreases. e.g., Ionised calium and PH are negatively correlated.

  2. The blue boxes indicate the highly positive variables between pair of variables. If the of one variable increases the correspoding other variable also increases. e.g.,pO2 and Arteiral Fio2.

  3. The boxes with the values greater than zero indicating positive correlation between pair of variables.

  4. The boxes with the values smaller than zero indicating negative correlation between pair of variables.

How the level of different factors (plotted in the y-axis) varies for covid patients from non-covid patients

Hematocrit level is high for covid patients

Hemoglobin level is high for covid patients

Platelets level is negative for covid patients

Total CO2 (arterial blood gas analysis level) is negative for covid patients

Red blood Cells count is high for covid patients

Lymphocytes count is positive for non-covid patients

Similar plots can be shown for other variable

Pateint addmitted to regular ward is high for covid patients

Leukocytes is negative for covid patients.

How the level of different factors (plotted in the X-axis) varies for covid patients from non-covid patients

The count for not-ditected Respiratory Syncytial Virus is high for non-covid patients.

The non-covid patients are detected with Rhinovirus/Enterovirus in high number compared to covid patients.

Influenza B is not detected in high number for non-covid patients.

Urine - pH level 7 is same for both negative and positive covid patients.

Patient age until 19 is less contracted with covid.

The comperisn of the level of the different factors for patients in different wards

Leukocytes

The plot shows Leukocytes count is negative for regular ward patients. For serous patients Leukocytes count is positive.

Platelets

Regular ward patient's platelets count is negative.

Hematocrit

Hematocrit level is high in negative value for patients in intensive care unit

Similar plots can be done for other variables

EDA Conclusion:

1. EDA analysis shows how different factors varies for covid and non-covid patients.
2. We have seen how the different factor levels vary for patients in different wards. 
3. Depending on these plots we can indentify the criticality of the patients and we can save many lives and also save the beds for critical patients. 

Data Preparation for modeling

Missing-Value Treatment

Model Building

Model evaluation criterion

Model can make wrong predictions as:

  1. Predicting the chances of being negative for covid19 but in reality, the chances of being positive for covid19.

  2. Predicting the chances of being positive for covid19 but in reality, the chances of being negative for covid19.

Which case is more important?

How to reduce the losses?

Let's define a function to output different metrics (including f1) on the train and test set and a function to show confusion matrix so that we do not have to use the same code repetitively while evaluating models.

Defining scorer to be used for cross-validation and hyperparameter tuning

We can see the performance of models with original data, let's see if oversampling the data can help us improve the performance.

Model Building with oversampled data

We can see that oversampling of data helped improve the performance a lot, now let's see how models perform with undersampled data.

Model Building with undersampled data

After looking at performance of all the models, let's decide which models can further improve with hyperparameter tuning

HyperparameterTuning

Tuning AdaBoost using oversampled data

Tuning AdaBoost using undersampled data

overfitting is reduced.

Sample tuning method for Decision tree with oversampled data

The model is still overfitting. The validation recall and f1 is still less than the traing recall and f1 in the oversampled data.

Sample tuning method for Decision tree with undersampled data

The recall values in validation data has been improved to oversampled data.

Tuning Bagging classifier using oversampled data

The recall score in validation set decreases compared to decision tree model but still overfitting. f1 is slightly increasing in validation set.

Tuning Bagging classifier using undersampled data

The overfitting is reduced in the undersampled data.

Tuning Random forest using oversampled data

The model is overfitting in trainning data.

Tuning Random forest using undersampled data

The overfitting is reduced.

Tuning Gradient Boosting using oversampled data

The overfitting is reduced compared to previous models.

Tuning Gradient Boosting using undersampled data

The model is performing better in the undersampled data.

Tuning XGBoost using oversampled data

The XGboost performance is better than other previous models in the traing data.

Tuning XGBoost using undersampled data

The overfitting is reduced.

We have now tuned all the models, let's compare the performance of all tuned models and see which one is the best.

Model performance comparison and choosing the final model

We want recall value maximized, i.e., minimising FN indicating less spreading the desease. We want also the precision maximized, i.e., minimising FP, indicating reducing death and help to increase the hospital accomodation. Dtree model gives highest recall score but precision score is less compared to gradient boosting model. Gradient boosting performing better in both recall and precision value. So, we will choose gradient boosting tuned with oversampled data as the final model.

Now we have our final model, let's find out how our model is performing on unseen test data

Let's use Pipelines to build the final model

Business Insights and Conclusions

  1. Our Analysis shows that gradient boosting model gives generalized performance with high recall score, that needed for minimizing the false negative. The disease spreading will be less. We want to minimize false negatives because if we predict that test result is negative, and the patient gets covid then the patient health will get worse and increase the spreading overwhelmingly.

  2. The model also predict higher precision, i.e., lower false positive. It help to reduce death and availability of hospital beds. We want to minimize FP because if we predict that test result is positive, and the patient does not get covid then the real covid positive patient will not get treatment. The hospitals bed will not be available for real patients hence the number of death with rapidly increase.

  3. Patient_age_quantile is most important feature followed by Rhinovirus/Enterovirus, Patient admitted to regular ward, Platelets and Leukocytes.

  4. This model can be further used to detect the patient will get covid or not. It will help to get covid identification and treatment for real covid patients. Also, it will help to get the information of availability of beds.

  5. The persons with age less than 19 has less chance to get contracted. The Platelets and Leukocytes levels for regular ward patients are negative. When they turn positive indicating patients need intensive care and special treatments.